{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "unknown-edwards",
   "metadata": {},
   "source": [
    "# Going Production: Auto-scale Hugging Face Transformer Endpoints with Amazon SageMaker\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "israeli-fetish",
   "metadata": {},
   "source": [
    "Welcome to this getting started guide, we will use the new Hugging Face Inference DLCs and Amazon SageMaker Python SDK to deploy a transformer model for real-time inference. \n",
    "In this example we are going to deploy a trained Hugging Face Transformer model on to SageMaker for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"sagemaker>=2.66.2\" --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "choice-steal",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.66.2.post0'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sagemaker\n",
    "sagemaker.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adequate-strain",
   "metadata": {},
   "source": [
    "## Deploy one of the 15 000+ Hugging Face Transformers to Amazon SageMaker for Inference\n",
    "\n",
    "To deploy a model directly from the Hub to SageMaker we need to define 2 environment variables when creating the `HuggingFaceModel` . We need to define:\n",
    "\n",
    "- `HF_MODEL_ID`: defines the model id, which will be automatically loaded from [huggingface.co/models](http://huggingface.co/models) when creating or SageMaker Endpoint. The 🤗 Hub provides +10 000 models all available through this environment variable.\n",
    "- `HF_TASK`: defines the task for the used 🤗 Transformers pipeline. A full list of tasks can be find [here](https://huggingface.co/transformers/main_classes/pipelines.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "from uuid import uuid4\n",
    "import sagemaker \n",
    "\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "# Hub Model configuration. https://huggingface.co/models\n",
    "hub = {\n",
    "  'HF_MODEL_ID':'yiyanghkust/finbert-tone', # model_id from hf.co/models\n",
    "  'HF_TASK':'text-classification' # NLP task you want to use for predictions\n",
    "}\n",
    "\n",
    "# endpoint name\n",
    "endpoint_name=f'{hub[\"HF_MODEL_ID\"].split(\"/\")[1]}-{str(uuid4())}' # model and endpoint name\n",
    "\n",
    "# create Hugging Face Model Class\n",
    "huggingface_model = HuggingFaceModel(\n",
    "   env=hub,\n",
    "   role=role, # iam role with permissions to create an Endpoint\n",
    "   name=endpoint_name, # model and endpoint name\n",
    "   transformers_version=\"4.11\", # transformers version used\n",
    "   pytorch_version=\"1.9\", # pytorch version used\n",
    "   py_version=\"py38\", # python version of the DLC\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "nonprofit-elephant",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-----!"
     ]
    }
   ],
   "source": [
    "# deploy model to SageMaker Inference\n",
    "predictor = huggingface_model.deploy(\n",
    "   initial_instance_count=1,\n",
    "   instance_type=\"ml.c5.large\"\n",
    ")\n",
    "# get aws region for dashboards\n",
    "aws_region = predictor.sagemaker_session.boto_region_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "noted-explosion",
   "metadata": {},
   "source": [
    "**Architecture**\n",
    "\n",
    "The [Hugging Face Inference Toolkit for SageMaker](https://github.com/aws/sagemaker-huggingface-inference-toolkit) is an open-source library for serving Hugging Face transformer models on SageMaker. It utilizes the SageMaker Inference Toolkit for starting up the model server, which is responsible for handling inference requests. The SageMaker Inference Toolkit uses [Multi Model Server (MMS)](https://github.com/awslabs/multi-model-server) for serving ML models. It bootstraps MMS with a configuration and settings that make it compatible with SageMaker and allow you to adjust important performance parameters, such as the number of workers per model, depending on the needs of your scenario.\n",
    "\n",
    "![](./imgs/hf-inference-toolkit.png)\n",
    "\n",
    "**Deploying a model using SageMaker hosting services is a three-step process:**\n",
    "\n",
    "1. **Create a model in SageMaker** —By creating a model, you tell SageMaker where it can find the model components. \n",
    "2. **Create an endpoint configuration for an HTTPS endpoint** —You specify the name of one or more models in production variants and the ML compute instances that you want SageMaker to launch to host each production variant.\n",
    "3. **Create an HTTPS endpoint** —Provide the endpoint configuration to SageMaker. The service launches the ML compute instances and deploys the model or models as specified in the configuration\n",
    "\n",
    "![](./imgs/sm-endpoint.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "enclosed-amino",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'label': 'negative', 'score': 0.9870443940162659}]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# example request, you always need to define \"inputs\"\n",
    "data = {\n",
    "   \"inputs\": \"There is a shortage of capital for project SageMaker. We need extra financing\"\n",
    "}\n",
    "\n",
    "# request\n",
    "predictor.predict(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "another-gardening",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in range(500):\n",
    "    predictor.predict(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stylish-witch",
   "metadata": {},
   "source": [
    "## Model Monitoring\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"https://console.aws.amazon.com/cloudwatch/home?region={aws_region}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'finbert-tone-73d26f97-9376-4b3f-9334-a2-2021-10-29-12-18-52-365~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~start~'-PT15M~end~'P0D~region~'{aws_region}~stat~'SampleCount~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![model-monitoring-dashboard](./imgs/model-monitoring-dashboard.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Auto Scaling your Model\n",
    "\n",
    "[Amazon SageMaker](https://aws.amazon.com/sagemaker/) is a fully managed service that provides every developer and data scientist with the ability to quickly build, train, and deploy machine learning (ML) models at scale.\n",
    "\n",
    "Autoscaling is an out-of-the-box feature that monitors your workloads and dynamically adjusts the capacity to maintain steady and predictable performance at the possible lowest cost.\n",
    "\n",
    "The following diagram is a sample architecture that showcases how a model is served as a endpoint with autoscaling enabled.\n",
    "\n",
    "\n",
    "\n",
    "![autoscaling-endpoint](./imgs/autoscaling-endpoint.png)\n",
    "\n",
    "\n",
    "### Reference Blog post [Configuring autoscaling inference endpoints in Amazon SageMaker](https://aws.amazon.com/de/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure Autoscaling for our Endpoint\n",
    "\n",
    "You can define minimum, desired, and maximum number of instances per endpoint and, based on the autoscaling configurations, instances are managed dynamically. The following diagram illustrates this architecture. \n",
    "\n",
    "![scaling-options](./imgs/scaling-options.jpeg)\n",
    "\n",
    "AWS offers many different [ways to auto-scale your endpoints](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html). One of them Simple-Scaling, where you scale the instance capacity based on `CPUUtilization` of the instances or `SageMakerVariantInvocationsPerInstance`. \n",
    "\n",
    "In this example we are going to use `SageMakerVariantInvocationsPerInstance` to auto-scale our Endpoint\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "# Let us define a client to play with autoscaling options\n",
    "asg_client = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services\n",
    "\n",
    "# he resource type is variant and the unique identifier is the resource ID.\n",
    "# Example: endpoint/my-bert-fine-tuned/variant/AllTraffic .\n",
    "resource_id=f\"endpoint/{predictor.endpoint_name}/variant/AllTraffic\"\n",
    "\n",
    "# scaling configuration\n",
    "response = asg_client.register_scalable_target(\n",
    "    ServiceNamespace='sagemaker', #\n",
    "    ResourceId=resource_id,\n",
    "    ScalableDimension='sagemaker:variant:DesiredInstanceCount', \n",
    "    MinCapacity=1,\n",
    "    MaxCapacity=4\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create Scaling Policy with configuration details, e.g. `TargetValue` when the instance should be scaled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = asg_client.put_scaling_policy(\n",
    "    PolicyName=f'Request-ScalingPolicy-{predictor.endpoint_name}',\n",
    "    ServiceNamespace='sagemaker',\n",
    "    ResourceId=resource_id,\n",
    "    ScalableDimension='sagemaker:variant:DesiredInstanceCount',\n",
    "    PolicyType='TargetTrackingScaling',\n",
    "    TargetTrackingScalingPolicyConfiguration={\n",
    "        'TargetValue': 10.0, # Threshold\n",
    "        'PredefinedMetricSpecification': {\n",
    "            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',\n",
    "        },\n",
    "        'ScaleInCooldown': 300, # duration until scale in\n",
    "        'ScaleOutCooldown': 60 # duration between scale out\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "stress test the endpoint with threaded requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "request_duration_in_seconds = 4*65\n",
    "end_time = time.time() + request_duration_in_seconds\n",
    "\n",
    "print(f\"test will run {request_duration_in_seconds} seconds\")\n",
    "\n",
    "while time.time() < end_time:\n",
    "    predictor.predict(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Monitor the `InvocationsPerInstance` in cloudwatch "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"https://console.aws.amazon.com/cloudwatch/home?region={aws_region}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'InvocationsPerInstance~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{aws_region}~start~'-PT15M~end~'P0D~stat~'SampleCount~period~60);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "check the endpoint instance_count number"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Endpoint finbert-tone-73d26f97-9376-4b3f-9334-a2-2021-10-29-12-18-52-365 has \n",
      "Current Instance Count: 4\n",
      "With a desired instance count of 4\n"
     ]
    }
   ],
   "source": [
    "bt_sm = boto3.client('sagemaker')\n",
    "response = bt_sm.describe_endpoint(EndpointName=predictor.endpoint_name)\n",
    "print(f\"Endpoint {response['EndpointName']} has \\nCurrent Instance Count: {response['ProductionVariants'][0]['CurrentInstanceCount']}\\nWith a desired instance count of {response['ProductionVariants'][0]['DesiredInstanceCount']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "flexible-trance",
   "metadata": {},
   "source": [
    "## Clean up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "failing-meaning",
   "metadata": {},
   "outputs": [],
   "source": [
    "# delete endpoint\n",
    "predictor.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "interpreter": {
   "hash": "ec1370a512a4612a2908be3c3c8b0de1730d00dc30104daff827065aeaf438b7"
  },
  "kernelspec": {
   "display_name": "Python 3.8.5 64-bit ('hf': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
